|
Boruta is an algorithm in the field of machine-learning, and more specifically, a feature-selection algorithm. The aim of the algorithm as presented in the original paper describing it is to find ''all relevant'' features (compare with ''minimal-optimal'' features set). The Boruta algorithm is not a stand-alone algorithm, but is implemented as a wrapper algorithm around the random-forest classification algorithm. In its essence, Boruta works in an iterative manner, and in each iteration the aim is to remove features which according to a statistical test, are less relevant than what is defined by the authors as a ''random probe''. One of the fundamental components of Boruta is the use of ''shadow attributes''. ''Shadow attributes'' are pseudo-features that are added to the information system, and produced by taking existing features from the original data-set and shuffling the values of those features between the original samples (data points). After generating the ''shadow attributes'' the procedure proceeds with building random-forest trees and comparing the Z-scores obtained by original features to Z-scores obtained by the ''shadow attributes''. This comparison is the foundation for Boruta to decide whether a feature is important or not. High level pseudo-code: 1. Copy all variables (features) 2. Shuffle values in each feature 3. Run random-forest on the extended system (shuffled features), gather Z scores 4. Find maximum MSZA (max Z-score among ''shadow attributes'') 5. Run random-forest on original features 6. Assign each original feature a hit if feature Z-score > MSZA 7. If Z-score <= MSZA, perform two-side equality test against MSZA 8. If Z-score < MSZA significantly, drop feature as unimportant 9. If Z-score > MSZA significantly, keep feature as important 10. Repeat from step 5 until all importance is determined for all features or max RF runs have been reached ==References== 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Boruta (algorithm)」の詳細全文を読む スポンサード リンク
|